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The authors report on the reliability and validity of peer assessment of writing in high 
school Advanced Placement English classrooms. Students used a task-specific rubric 
to anonymously assess their classmates' writing. 


C urrent literacy policies stress the need for high 
school students to develop argumentative writ¬ 
ing skills in order to be prepared for college, for 
many careers, and for being critically engaged citizens 
(Newell, VanDerHeide, & Olsen, 2014). However, recent 
National Assessment of Educational Progress results 
suggest that many secondary students struggle with an¬ 
alytical, argument-driven writing (National Center for 
Education Statistics, 2012). One reason for this may be 
that high school students are provided with few oppor¬ 
tunities to write and receive feedback on their writing 
(Applebee & Langer, 2013; Kiuhara, Graham, & Hawken, 
2009). Surveys of teachers have shown that one reason 
teachers do not assign significant amounts of writing is 
due to the amount of time needed to grade and provide 
feedback on it (Gilbert & Graham, 2010; Kiuhara et al., 
2009). 

One approach to writing instruction that has been 
shown to improve secondary and college students’ 
academic writing across disciplines, school settings, 
and grade levels without increasing the demands on 
teachers’ time is peer review (Cho & MacArthur, 2011; 
Simmons, 2003). Although there are various ways that 
English language arts and other subject area teach¬ 
ers design such peer review activities, students typi¬ 
cally read their peers’ essays and provide feedback in 
response to teacher-generated questions or prompts. 
Writers then use the feedback to revise and ideally im¬ 
prove their essays before submitting them to the teacher 
for a final grade. 

However, many teachers and students worry that stu¬ 
dents’ feedback and assessment of their peers’ writing 


is less accurate than teachers’ (Gielen, Peeters, Dochy, 
Onghena, & Struyven, 2010; Hovardas, Tsivitanidou, & 
Zacharia, 2014; Kaufman & Schunn, 2011). Two studies 
of adolescents’ perceptions of peer review of writing 
found that students perceived teacher feedback as more 
accurate and valuable than that of their peers even when 
the students used their peers’ feedback to revise their 
essays (Gielen et al., 2010; Hovardas et al., 2014). 

This study investigated whether high school 
Advanced Placement (AP) students in diverse school set¬ 
tings can accurately assess their peers’ writing if given a 
carefully designed rubric to guide their assessment and 
feedback. We first explain the logic and method behind 
the construction of our rubric, a student-friendly version 
of the College Board’s (2013) rubric for argument-driven 
AP English Language and Composition essays. We then 
examine the reliability and validity of the students’ as¬ 
sessments by comparing them with their teachers’ and 
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trained AP scorers’ assessments. Finally, we briefly 
discuss students’ and teachers’ perceptions of the peer 
review process. 

Research on Peer Review 

Peer review has been a common instructional approach 
in middle and high school English classes since the 1980s 
and is considered a key element of process approaches 
to writing (Atwell, 1987; Graham & Perin, 2007; Hillocks, 
1984). However, little research on the validity and re¬ 
liability of high school students’ assessments of their 
peers’ writing has been conducted (Brookhart & Chen, 
2015). 

At the secondary level, the design of peer review tasks 
and their effectiveness can vary greatly by classroom. 
In the absence of specific guidance, such as scoring ru¬ 
brics and targeted prompts for qualitative feedback, 
students often provide their peers with empty praise 
(Tsivitanidou, Zacharia, & Hovardas, 2011; VanDeWeghe, 
2004), avoid critiques of their peers’ work because 
of fear of negative social consequences (Freedman, 
1992; VanDeWeghe, 2004), or focus only on editing 
sentence-level errors (Freedman, 1992; Simmons, 2003). 
Additionally, peer review guidelines that ask students to 
provide yes/no answers or answer low-level questions 
(e.g., “Does the essay have three paragraphs?”) lead to 
low-quality peer feedback that is not useful to writers 
(Goldberg, Roswell, & Michaels, 1995). 

Two of the features essential to productive peer re¬ 
view are holding students responsible for taking the task 
of providing feedback seriously and lessening adoles¬ 
cents’ fear of social repercussions of honest peer review. 
High school students frequently ignore written guide¬ 
lines provided by their teachers in their peer reviews 
and assessments when the reviewing process has little 
teacher oversight or accountability (Freedman, 1992; 
Peterson, 2003). 

Additionally, in face-to-face peer review, students 
are often more concerned about social dynamics, such 
as whether they will be perceived as mean or embar¬ 
rass their peers, than providing accurate peer feedback 
(Christianakis, 2010; Peterson, 2003). The online peer 
feedback system that was used for this study addressed 
these concerns in two ways: by making the quality of the 
students’ peer reviews part of their grade for the task 
and by offering a double-blind peer review process in 
which both authors and reviewers are anonymous. 

Moreover, peer review of writing is most effective 
when guidelines for assessing and providing feedback 
require students to base their assessments on specific, 
understandable criteria and offer detailed suggestions 


for improvement (Gan & Hattie, 2014). Students also gain 
more from peer review when they are guided to focus 
more on global issues in writing, such as ideas and evi¬ 
dence, and when the activity develops students’ aware¬ 
ness of audiences other than the teacher (Freedman, 
1992; Simmons, 2003). 

Studies at the college level have demonstrated that 
when students are guided by a clear rubric and held ac¬ 
countable for the quality of their peer feedback, their as¬ 
sessments of their peers’ writing have strong reliability 
and validity (Cho, Schunn, & Wilson, 2006; Panadero, 
Romero, & Strijbos, 2013). We use the term reliability 
to refer to the extent to which students’ assessments of 
peers’ writing correlate with each other (i.e., inter-rater 
reliability). We use validity to describe the extent to 
which students can accurately judge what they are asked 
to assess in one another’s writing, such as the quality of 
a thesis. 

However, few studies of the validity and reliability of 
secondary students’ peer assessments exist, and none 
to our knowledge focus on writing in English language 
arts. In Sadler and Good’s (2006) study of seventh-grade 
science students’ assessments of their peers’ work (a 
task that included both multiple-choice and open-ended 
responses), the researchers found high correlations be¬ 
tween the grades assigned by teachers and peers. 

Other studies conducted in secondary computer sci¬ 
ence classes have shown similar results (Sung, Chang, 
Chiou, & Hou, 2005; Tseng & Tsai, 2007). Conversely, 
still other studies have found that secondary students’ 
peer and self-assessments of writing are not consistent 
with teacher assessments (Chang, Tseng, Chou, & Chen, 
2011; Varner, Roscoe, & McNamara, 2013). Our study 
adds to this body of research by focusing specifically on 
the reliability and validity of peer assessment of writing 
in English language arts when students are given a high- 
quality rubric and incentives for providing helpful feed¬ 
back and accurate assessments. 

Methods 

Participants 

Twenty-eight AP English Language and Composition 
teachers from 26 different schools located in 12 states 
across the United States took part in the study. Schools 
were primarily located in suburban areas (57%) but also 
included urban (25%) and rural (18%) areas. Most schools 
were traditional public (71%), but others were Catholic 
(11%), independent (11%), or charter schools (7%). 

Based on academic performance data available on¬ 
line (e.g., ACT or SAT scores), 68% of the schools were 
high performing (with school means above national 
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averages), and 32% were low performing (with school 
means below national averages). A number of schools 
also had high proportions of historically underserved 
students of color and/or students eligible for free or 
reduced-price lunch. 

The teachers were recruited through e-mails sent 
through the College Board and the National Writing 
Project. All but two teachers had previously used in-class 
peer assessment before, but only four had used online 
peer assessment (e.g., through Turnitin: turnitin.com). 
All but one teacher had given prior AP exams as practice 
earlier in the year, and most teachers had given more 
than four. Teachers participated in a one-hour online 
training session to learn about the online peer review 
system Peerceptiv (www.peerceptiv.com) and the struc¬ 
ture of the study. Teachers also participated in an “act as 
a student” exercise in Peerceptiv to experience the sys¬ 
tem through the eyes of students. 

A total of 1,215 students participated in the study. The 
largest number of students per teacher was 134 (in mul¬ 
tiple class sections), and the smallest was 13. 

Implementation of 
the Peer Review System 

Peer assessment can be implemented through many 
different methods, from face-to-face discussions of 
drafts to Web-based distribution and rating methods. 
Peerceptiv is a Web-based system in which authors and 
reviewers are anonymous so students feel comfortable 
being more honest in their feedback. It has been used by 
over 50,000 students around the world, and currently 
approximately half of its users are high school students. 
Links for accessing Peerceptiv and other online review 
systems are shared in the More to Explore sidebar at the 
end of this article. 

Participating teachers assigned a peer review activity 
using the Peerceptiv system to at least one class section 
of students between early April and early May. Before 
starting the peer review process, the teachers present¬ 
ed their students with a 30-45-minute lesson, provided 
by the researchers, on high-quality peer feedback (see 
Figure 1). The peer review process followed these steps: 

1. Students uploaded their essays into the system. 

2. The system automatically distributed the essays so 
each student received five peer essays. 

3. Students used the rubrics developed by the research¬ 
ers (described in the Description of Rubric Design 
section) to assess their peers’ essays. The rubrics in¬ 
clude both open-ended prompts (e.g., “Provide feed¬ 
back on how well the author explained the textual 


Figure 1 

Researcher-Designed Lesson on Good Peer Review 

1. Explain to students, Why are we doing peer review? 

• Writing a sample AP essay will give you practice for the test. 

• Multiple peer reviews will help you see strengths and areas 
for improvement in your essay. 

• Reviewing other students’ papers will give you new ideas 
about AP essays. 

2. Ask students to think about some of the feedback they have 
received and given. 

• What kind of feedback has been most helpful? 

• What kind of feedback has been least helpful? 

3. Whole-class discussion of examples of feedback: 

• Read the sample AP prompt. 

• Read the sample AP essay. 

• Read two different samples of peer feedback on the 
same essay. 

• Discuss what makes each type of peer feedback 
helpful (or not). 

4. Sum up the discussion. Emphasize that helpful feedback: 

• Points out weaknesses or areas for improvement in the essay. 

• Tells the author WHERE in the paper the problem or 
weakness is. 

• Provides SPECIFIC suggestions to help the author improve 
hisor her work. 

• Note that: 

• General praise (aka “strong essay”) is not helpful feedback. 

• Point out areas for improvement but be kind and respectful. 

5. Practice with a partner (or individually) giving feedback on a 
different sample AP essay. 

6. Discuss in pairs, small groups, or whole class about the 
feedback you gave to the essay. 

7. Introduce Peerceptiv and how it works. 

8. Introduce the revised AP rubric and check for student 
understanding of it. 

Note. AP = Advanced Placement. 

evidence he or she provided”) and numerical ratings 
(e.g., on a scale of 1-7, “How strong is the evidence for 
each claim about Louv’s rhetorical strategies?”). 

4. After all reviews were completed, authors received 
two kinds of feedback: open-ended feedback provided 
by peer reviewers and scores that reflected the mean 
(average) of the reviewers’ numerical ratings. 

5. Authors then rated the helpfulness of the comments 
they received. 

6. Peerceptiv automatically generated individual stu¬ 
dent grades for the peer review task based on the 
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quality of the essays (as determined by the average 
of peers’ ratings), the quality of the peer reviews, and 
the on-time completion of all aspects of the task. 

7. Students used peer feedback to revise their essays. 

The quality of students’ peer reviews is calculated by 
Peerceptiv in two ways: by authors’ ratings of the help¬ 
fulness of their open-ended feedback comments (step 5) 
and by the accuracy of the reviewer’s numerical ratings 
of peers’ essays (step 3). The accuracy of the reviewer’s 
ratings is determined by comparing his or her ratings 
with the mean ratings produced by other peers on those 
same essays. The closer the reviewer is to the mean rat¬ 
ings produced by other students (across rating criteria 
and essays), the higher the accuracy grade. Because 
students’ review grades decline the more their ratings 
differ from other reviewers’, Peerceptiv provides a 
strong disincentive for students to cheat the system and 
give undeserved high or low grades to other students, 
even if they share their pseudonyms. 

Essay Prompts 

The essay prompt for this study was taken from the 
2013 AP English Language and Composition exam. It 
presented students with a one-page passage (from Last 
Child in the Woods: Saving Our Children From Nature- 
Deficit Disorder by Richard Louv, 2008) and asked them 
to analyze the rhetorical strategies used by the author 
to develop his argument, with specific references to the 
text. This particular type of essay prompt is a common 
feature of the AP English Language and Composition 
exam and reflects a core instructional goal of the class 
(i.e., the analysis of written arguments). 

Teachers were instructed to give students the sug¬ 
gested 40-minute time period to complete their respons¬ 
es. Because some teachers chose to have their students 
complete the writing outside of class, we cannot ensure 
that students limited their writing time to 40 minutes; 
however, the length of time students were given to write 
does not affect our findings. 

Essay Scoring 

Students’ assessments of their classmates’ essays were 
compared with those of other students, their teachers, 
and expert AP essay scorers to study their reliability and 
validity. We investigated the reliability of students’ peer 
assessments by examining the extent to which students’ 
numerical ratings correlated with one another for the 
same essays (i.e., inter-rater reliability). We then investi¬ 
gated the validity of students’ assessments of their peers’ 
essays by comparing them with teachers’ and expert AP 


scorers’ assessments of the same essays. Each partici¬ 
pating teacher rated at least 15 of his or her students’ es¬ 
says using the researcher-designed rubric. On average, 
teachers rated 21 essays, for a total of 489 rated essays. 
Correlations between the student mean ratings (across 
the five peer assessments per essay) and the teacher rat¬ 
ings were computed separately for each teacher. 

Although much of the research literature has used 
single-teacher ratings as the typical source of validity 
data, this approach has two problems. First, a single¬ 
teacher rating has unknown reliability itself. Second, 
high-stakes assessments, such as the AP tests, use mul¬ 
tiple expert graders for each essay, and the graders go 
through careful training processes to ensure reliability. 

Thus, three experienced AP graders who had been 
trained by the College Board also scored a subset of 100 
AP essays. The essays chosen were evenly distributed 
across the teachers, sampled from the essays that both 
students and teachers had evaluated. Unlike students 
and teachers who used our revised rating rubric, the 
expert AP graders used the traditional College Board 
(2013) holistic rubric (with a scale of 1-9) to produce 
overall document scores through a scoring method 
closely aligning with the typical expert grading process. 
Validity of the student ratings was assessed by the cor¬ 
relation with expert AP ratings. Note that even though 
the expert AP raters used a 9-point scale and students 
and teachers used a 7-point scale (described in the 
Description of Rubric Design section), mathematically, 
the correlations between ratings are not influenced by 
these scale differences. 

Surveys 

At the end of the study, both students and teachers were 
sent links to an online survey about their perceptions of 
the benefits and disadvantages of the peer review task, 
and suggestions for improving the Peerceptiv system. 
The teacher survey also asked about prior experience 
with peer review, typical writing instruction, and the 
likelihood of using Peerceptiv in future years. Twenty- 
six of the 28 participating teachers (93%) and 343 (28%) 
of the students (a high rate compared with typical high 
school student survey responses of 13-20%; Porter & 
Whitcomb, 2003) completed their respective surveys. 

Description of Rubric Design 

A rubric is a set of criteria for evaluating student work 
that includes descriptions of performance levels (not 
just a numerical scale; Brookhart & Chen, 2015). Rubrics 
are widely used in K-12 writing instruction and for high- 
stakes writing assessments. Overall, rubrics have been 
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shown to increase student achievement and motivation 
in writing, but only when the rubrics convey clear and 
specific descriptions of quality levels for task-specific 
criteria (Andrade, Du, & Wang, 2008; Brookhart & Chen, 
2015). 

However, in high-stakes writing assessments, the 
language of the rubrics is often general and typically 
designed to be used only after extensive training and 
discussion of multiple benchmark sample essays. In the 
case of the AP English Language and Composition scor¬ 
ing guide, for instance, the difference between a score 
of 8, 6, or 4 on the g-point scale hinges on whether stu¬ 
dents “effectively,” “adequately,” or “inadequately” ana¬ 
lyze Louv’s rhetorical strategies (for the full rubric, see 
College Board, 2013). Additionally, the terminology used 
in such rubrics may be unfamiliar to most high school 
students. In the AP rubric, we noted phrases such as 
“lapses in diction or syntax” and “mature prose style” 
(p. 2) as descriptors that would likely be unclear to most 
high schoolers. 

The current study took on these challenges by re¬ 
vising the College Board’s (2013) AP scoring guide to be 
more understandable and user-friendly for students. 
Our revisions were guided by four key design principles: 

1. Students should be able to understand the language 
of the rubric, including writing terminology. 

2. The rubric should focus students’ attention on criti¬ 
cal, high-value elements of the piece of writing. 

3. The rubric should include concrete and specific de¬ 
scriptors or examples, not vague or comparative qual¬ 
ifiers such as adequate or more sophisticated. 

4. Each rubric criterion should focus on only one aspect 
of the writing task (e.g., on quality of evidence not 
quality and explanation of evidence). 

Following these principles, we first changed the ho¬ 
listic rubric to an analytic rubric. Holistic rubrics pro¬ 
vide a single score based on an overall impression of 
student work, whereas analytic rubrics provide sepa¬ 
rate feedback on each important characteristic of the 
task. This focused student reviewers’ attention on the 
most important characteristics of the essay rather than 
on determining an overall score. We identified eight 
essential characteristics or assessment criteria in the 
College Board (2013) rubric and in the scored essays on 
the College Board website (apcentral.collegeboard.com/ 
apc/members/exam/exam_information/200i.html): 

1. Thesis 

2. Explanation of Louv’s argument 


3. Analysis of Louv’s rhetorical strategies 

4. Evidence for claims 

5. Explanation of evidence 

6. Organization 

7. Control of language 

8. Conventions 

Next, we generated more concrete and specific de¬ 
scriptions for the performance levels for each criterion. 
In some places, we substituted phrases (e.g., “analyzes 
multiple, subtle rhetorical strategies” instead of “ana¬ 
lyzes effectively”), and in other places, we added ex¬ 
amples. In Figure 2, we share a portion of the original 
College Board (2013) scoring guide for writers’ analysis 
of Louv’s argument and a corresponding example from 
our revised rubric (the complete rubric is available at 
www.lrdc.pitt.edu/schunn/sword/jaalrubrics.docx). Our 
rubric is distinct from the College Board’s scoring guide 
in a number of ways. 

First, we separated two elements of the original ru¬ 
bric—explanation of Louv’s argument (not shown in 
Figure 2) and analysis of Louv’s rhetorical strategies—so 
students could separately assess writers’ understand¬ 
ing of Louv’s argument and their analysis of his use of 
rhetorical strategies. Second, we added explicit descrip¬ 
tions and examples of subtle and more obvious rhetorical 
strategies to clarify the characteristics of a high-scoring 
analysis of Louv’s rhetorical strategies. Similarly, for the 
lowest possible score, we added descriptions of essay 
topics that might seem related to the essay prompt but 
did not address the prompt directly. 

Third, we quantified the number of rhetorical strate¬ 
gies analyzed by students that the College Board scorers 
seemed to expect for each rating on the scoring guide. 
Finally, because Peerceptiv uses a 7-point scale for peer 
assessments that is nonadjustable, we converted the AP 
rubric’s g-point scale to 7, providing descriptions of per¬ 
formance levels for scores of 7,5,3, and 1 and leaving the 
intermediate scoring levels (6,4, and 2) without descrip¬ 
tions so as not to overwhelm students. 

In addition, open-ended commenting prompts were 
created for each rubric criterion so students could pro¬ 
vide narrative feedback to one another. The only excep¬ 
tion was that a combined commenting prompt was used 
for academic vocabulary and conventions. An early ver¬ 
sion of the rubric was piloted with three students cur¬ 
rently enrolled in AP English Language and Composition 
courses, and then revised based on their feedback. This 
revised version was then tested on the teachers as part 
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Figure 2 

Comparison of College Board 3 and Researcher-Created Advanced Placement (AP) Rubric Criteria for 
Analyzing Louv's b Rhetorical Strategies 


College Board AP scoring cuide 

Revised AP rubric 

No open-ended feedback 

Open-ended feedback: 

Provide feedback on how well the author analyzed how Louv's 

rhetorical strategies support his argument throughout the essay. Be 

specific about how the writer could improve his or her thesis 

analysis, and provide suggestions for improvements. 

Ratings: 

Ratings: 

What rhetorical strategies did the author analyze in his or her 

essay? 

9 

7—The author accurately analyzes multiple, subtle rhetorical 

strategies that Louv uses (e.g., appealing to a common cause, 

evoking nostalgia, or other sophisticated strategies). 

“8—Effective 

Essays earning a score of 8 effectively analyze* the rhetorical 

strategies Louv uses to develop his argument about the separation 

between people and nature.” 

“* For the purposes of scoring, analysis refers to explaining how the 

author's rhetorical choices develop meaning or achieve a particular 

effect or purpose." 

6 

7 

5—The author analyzes three or more obvious rhetorical strategies 

that Louv uses (e.g., using rhetorical questions, anecdotes, or other 

obvious strategies). 

“6—Adequate 

Essays earning a score of 6 adequately analyze the rhetorical 

strategies Louv uses to develop his argument about the separation 

between people and nature.” 

4 

5 

3—The author analyzes only one or two obvious rhetorical 

strategies that Louv uses (e.g., rhetorical questions) or 

misunderstands Louv's strategies. 

“4—Inadequate 

Essays earning a score of 4 inadequately analyze the rhetorical 

strategies Louv uses to develop his argument about the separation 

between people and nature. These essays may misunderstand the 

passage, misrepresent the strategies Louv uses, or may analyze 

these strategies insufficiently.” 

2 

3 

1—The author didn't write about Louv's rhetorical strategies (instead 

discussed a different topic, connected to personal experience, or 

just summarized Louv's piece). 

“2—Little Success 

Essays earning a score of 2 demonstrate little success in analyzing 

the rhetorical strategies Louv uses to develop his argument about the 

separation between people and nature. These essays may 

misunderstand the prompt, misread the passage, fail to analyze the 

strategies Louv uses, or substitute a simpler task by responding to the 

prompt tangentially with unrelated, inaccurate, or inappropriate 

explanation." 


1 



Note. a College Board. (2013). AP® English Language a Composition 2 013 scoring guidelines. New York, NY: Author. b Louv, R. (2008). Last child in the woods: 
Saving our children from nature-deficit disorder. Chapel Hill, NC: Algonquin. 
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of their training and further revised to generate the final 
version of the rubric. 

Findings 

Reliability of Student Ratings 

Our first research questions were, Did the students tend 
to agree with one another at sufficiently high levels to 
produce reliable scores? Did particular characteristics 
of writing, such as more complex ones like reasoning 
and warranting, lead to more disagreement in students’ 
scores? 

Peerceptiv automatically calculates the reliabil¬ 
ity of the mean peer quantitative ratings for each essay 
using an interclass correlation (ICC) and provides this 
information to teachers. This measure examines how 
consistently each student’s pattern of ratings are with 
the ratings produced by the other students to determine 
the stability or trustworthiness of the resulting mean 
rating across students for each document. If the ICC is 
high, then assigning a given document to another group 
of students would produce the identical ratings; if the 
ICC is low, then another group of students could often 
produce a different rating. 

Figure 3 shows these ICC values for each rubric crite¬ 
rion. All criteria except conventions show acceptable re¬ 
liability, that is, ICC values above .40 (Fleiss, 1986), which 
is a common level of agreement among teachers scoring 
a set of documents (Cho et al., 2006). The low ICC for the 
conventions criterion shows that students seem to dis¬ 
agree a lot about whether an essay generally follows the 
conventions of standard written English. The first two 

Figure 3 

Mean Reliability (and Standard Error Bars) for 
Mean Peer Ratings on Each Rubric Criterion 


.90 



Thesis Rhetorical Louv's arg Evidence for Explaining Organization Control of Conventions 
strategies claims evidence language 

Rating dimension 
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criteria, quality of thesis and analysis of rhetorical strat¬ 
egies, have the highest reliability. 

Thus, students seem to be able to reliably judge the 
more higher level or complex aspects of essay quality 
even more than lower level features, such as mechan¬ 
ics. Additionally, there were no statistically significant 
differences in reliability across higher versus lower per¬ 
forming schools on any of the rubric criteria, suggesting 
that students from diverse school contexts can reliably 
and accurately assess their peers’ writing. 

Validity of Student Ratings Against 
Teacher Ratings and Expert AP Ratings 

Our next research question was, What is the extent to 
which students can accurately judge the quality of their 
peers’ essays?—with “accuracy” determined by two sets 
of experts (AP English teachers and trained AP scorers). 
To answer this question, student ratings were compared 
with both teacher ratings and expert AP scorers’ ratings 
to analyze the validity of the student ratings. As noted 
previously, each teacher rated at least 15 of their stu¬ 
dents’ essays, and some teachers rated over 35 essays. 

Correlations between teacher and mean student rat¬ 
ings for each paper were computed for each rating cri¬ 
terion to examine whether some criteria were judged 
more accurately than others. In addition, the correla¬ 
tions were calculated separately for each teacher to 
examine whether the alignment between teacher and 
student ratings varied by context. Mean correlations 
are presented in Figure 4. Correlations were generally 

Figure 4 

Mean Correlation (and Standard Error Bars) Across 
Classes of Mean Student Ratings With Teacher 
Ratings 



Thesis Rhetorical Louv's arg Evidence for Explaining Organization Control of Conventions Total Score 
strategies claims evidence language 

Rating dimension 
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between .4 and .6 across rubrics and almost .7 for the 
overall essay rating. 

These correlations are higher than what is typically 
seen between two teachers’ ratings on a set of essays (Cho 
et al., 2006), suggesting that the accuracy of students’ 
ratings of their peers’ essays, when calculated by taking 
the average of five students’ ratings, is very high. Finally, 
the student versus teacher correlations were very simi¬ 
lar across higher and lower performing schools. 

Figure 5 presents the correlations of the mean over¬ 
all essay scores of students (averaged across student 
raters and rubrics) and teacher scores (averaged across 
rubrics) against the expert AP ratings. Pearson correla¬ 
tions are used as the primary metric because the ratings 
are not all on the same scale (i.e., student and teacher 
ratings are on the Peerceptiv 1-7 scale rather than the 
AP 1-9 scale). 

There are a number of important outcomes to note 
from the results presented in Figure 5. First, both cor¬ 
relations between student mean scores and expert AP 
scores and between teacher and expert AP scores were 
acceptably high (at or above .5) and relatively similar. 
Despite the complexity of rating the AP essays even for 
expert AP raters, this study revealed that both students 
and teachers could produce useful ratings when provid¬ 
ed with a carefully designed rubric. Second, the mean of 
student ratings correlated with experts’ slightly more 
highly than did the teacher ratings. That is, the mean of 

Figure 5 

Correlations With Advanced Placement Expert 
Ratings for Mean Student Ratings and Individual 
Teacher Ratings 



Student mean Teacher 


five student ratings appears to be even more valid than 
single-teacher ratings. This suggests that if multiple stu¬ 
dents assess a peer’s essay using a well-designed rubric, 
the average of the students’ ratings could potentially be 
used in place of a teacher-generated grade. 

Student and Teacher Perceptions 
of Peer Review 

Our final research question was, What were students’ 
and teachers’ perceptions of the peer review process? 
Overall, both students and teachers thought that using 
Peerceptiv for peer review was beneficial. A majority 
of students agreed with each of the possible benefits of 
peer review on our survey (see Figure 6). Although stu¬ 
dents generally thought that they had received good ad¬ 
vice from their peers, the strongest perceived benefits 
involved seeing successful strategies and weaknesses 
in other students’ essays. The perception that assess¬ 
ing peers’ writing is more beneficial than receiving peer 
assessments is consistent with the findings of previous 
studies (Godley, Loretto, & DeMartino, 2014). 

The survey also provided insights into student con¬ 
cerns (see Figure 7). The only concerns endorsed by a 
majority of students were that they did not like having 
grades based on peer assessments, and they were con¬ 
cerned about the workload, which mirrors the find¬ 
ings of previous research (Kaufman & Schunn, 2011). 
Regarding workload, most comments suggested that 
four peer assessments would be reasonable rather than 


Figure 6 

Percentage of Students Agreeing or Strongly 
Agreeing With Each Possible Benefit of Peer 
Assessment of Writing (With Standard Error Bars) 
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Figure 7 

Percentage of Students Agreeing or Strongly 
Agreeing With Each Possible Disadvantage of 
Peer Assessment of Writing (With Standard Error 
Bars) 


I didn't think it was fairthatour peer feedback 
to other students was part of our grade 

I thought reviewing five essays was an 
unreasonable amount of work 

I prefer to get feedback only from my teacher 

I prefer to talk face-to-face with my peers for 
peer review 

I didn't like giving advice to my peers about 
their writing 

I didn't like having peers grade myessay 



^Agreement 


five. Also, only a minority of students preferred teach¬ 
er feedback or face-to-face (rather than online) peer 
feedback, suggesting that student buy-in for using peer 
feedback instead of teacher feedback is strong, even if 
students question whether their peers’ assessment of 
their work is accurate. 

The teacher survey also probed the perceived ben¬ 
efits and disadvantages of online peer assessment. In 
general, all but one of the teachers agreed that students 
learned from the peer review activity and gave helpful 
feedback to their peers. Like students, teachers per¬ 
ceived the benefits of giving feedback as even greater 
than receiving feedback. All but one teacher felt that 
their students were better prepared for the AP exam as a 
result of only one round of peer assessment. 

Finally, in terms of the feasibility of using online peer 
assessment, all but one teacher felt that this was a con¬ 
venient way for students to receive detailed feedback on 
their writing, and 80% of the teachers felt that they could 
easily implement this kind of peer assessment on a regu¬ 
lar basis. 

Only one teacher did not want to use online peer as¬ 
sessment in the next year. Of all the teachers, 35% wanted 
to do so a few times per year, and almost 60% wanted to 
use online peer assessment regularly throughout the 
year. Overall, teachers generally saw that the benefits out¬ 
weighed the difficulties and that online peer assessment 
of this form significantly adds value over more informal 
peer feedback that can be done face to face in class. 


Conclusion 

Can high schoolers reliably assess their peers’ academ¬ 
ic writing? The results of our study suggest yes. Given 


teachers’ and students’ concerns about the validity 
and reliability of peer assessment of writing, this study 
demonstrated that when using a carefully designed ru¬ 
bric, high school students were able to provide ratings 
that were more valid than the ones provided by a single 
teacher and just as valid as the ones provided by expert 
AP scorers. Further, most students perceived their 
peers’ feedback as helpful. 

Across school contexts (urban, suburban, and socio¬ 
economic status levels) and student writing achievement 
levels, students’ ratings were acceptably correlated with 
teachers’ and experts’, and students were able to consis¬ 
tently rate the higher level aspects of academic writing, 
such as explaining evidence. Thus, it does not seem to be 
the case that only the most capable students are able to 
participate meaningfully in peer assessment activities. 
These positive results of peer assessment across con¬ 
texts mirror the findings of prior research examin¬ 
ing the peer reviews of stronger and weaker students 
(Patchan, Hawk, Stevens, & Schunn, 2013). Finally, both 
teachers and students perceived many benefits of par¬ 
ticipating in peer assessment, and most participating 
teachers believed that online peer assessment should be 
used regularly in their classes. 

Our study has a number of limitations. First, it was 
limited to one approach to peer review that included 
three key features: anonymity, a student-friendly rubric, 
and accountability for the quality of feedback given to 
peers. It is possible that different results would be ob¬ 
tained from another approach to peer review, whether 
online or face to face. Second, we focused on a writing 
task that was familiar to the AP students: an analysis of 
rhetorical strategies. Although the rubric used for peer 
assessment was new to the students, it is likely that the 
familiarity of the task increased their understanding of 
the expectations and, thus, the reliability and validity of 
their peer assessments. 

Third, the study involved only AP students, who tend 
to be in 11th and 12th grades and have developed rela¬ 
tively stronger academic literacy skills than other high 
school students. However, in recent years, there has 
been a nationwide movement to enroll younger students 
with a wider range of academic preparedness in AP 
classes (Godley, Monroe, & Castma, 2015). Additionally, 
our other studies have demonstrated the benefit of peer 
review in diverse, non-AP high school classes (Godley 
et al., 2014). 

Fourth, our study did not tackle the issue of how to 
convince students that peer-generated grades are fair 
and can be used in place of teacher grading. As the re¬ 
search on peer assessment at the college level has shown, 
students’ perceptions of fairness are a significant 
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TAKE ACTION! 


These steps can be done without technology, but 

online tools simplify the logistics: 

1. Find online peer review tools that are available for 
your school or district. 

2 . Create a rubric with student-friendly language. 

3 . Create guidelines for how students should participate 
in the assessment process. 

4 . Teach students about good peer review. Allow them 
time to practice assessing a shared piece of writing in 
pairs or small groups. 

5 . Anonymity may be achieved by having students 
select pseudonyms or assigning students numbers 
that only you and they know and that they will put on 
each document used during peer review, keeping the 
writers' and reviewers' identities a secret. 

6 . Assign three or four peers to review each student's 
paper. Research has shown that students learn as 
much or more from giving reviews as from receiving 
them. 

7 . Before delivering the peer feedback to students, scan 
it for common issues and average reviewers' ratings. 

8 . In class, discuss papers that received conflicting 
reviews. 

9 . Allow students time to reflect on the process of peer 
review before asking them to revise. This will give 
students time to consider their own understanding 
of the rubric and synthesize feedback from multiple 
peers before revising. 


challenge to the use of peer assessment in determin¬ 
ing task or course grades (Kaufman & Schunn, 2011). 
Finally, some students were unhappy with the workload 
required to review five peers’ essays. Mathematically, 
fewer peer assessments per essay, such as three or four, 
would generate mean essay scores with slightly lower va¬ 
lidity; however, it is expected that four peer assessments 
per essay would still produce ratings as valid as those 
produced by a single teacher. 

Despite these limitations, our study suggests that 
carefully designed peer review tasks can provide stu¬ 
dents with helpful feedback on and fair assessments of 
their writing without increasing English language arts 
teachers’ workload. We believe that peer assessment can 
be a feasible and productive solution to the challenge of 
asking high school students to write and revise more 
often without increasing English teachers’ paper load. 
Given the evidence that students learn from peer review 


and from using rubrics, peer assessment can enhance 
writing instruction in high schools. 

NOTES 

This study was funded by The College Board and by the 
Institute of Education Sciences, U.S. Department of Education 
(R305A120370). The first author has a significant financial 
interest in the company that makes Peerceptiv available. 
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MORE TO EXPLORE 


Resources for online peer review: 

■ Peerceptiv: www.peerceptiv.com 

■ Calibrated Peer Review: cpr.molsci.ucla.edu/cpr/ 
cprjnfo/index.asp 

■ Eli Review: https://app.elireview.com 

■ PeerMark (a feature of Turnitin): https://guides 
.turnitin.com/01_Manuals_and_Guides/Student/ 
Student_User_Manual/19_PeerMark 

Resources for using rubrics: 

■ Edutopia's "Resources for Using Rubrics in 
the Middle Grades": www.edutopia.org/ 
rubrics-middle-school-resources 

Print resources: 

■ Andrade, H.G. (2000). Using rubrics to promote 
thinking and learning. Educational Leadership, 57(5), 
13-18. 

■ Topping, K.J. (2009). Peer assessment. Theory Into 
Practice, 48(1), 20-27. 
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